Clustering data with measurement errors
نویسندگان
چکیده
Traditional clustering methods assume that there is no measurement error, or uncertainty, associated with data. Often, however, real world applications require treatment of data that have such errors. In the presence of measurement errors, well-known clustering methods like k-means and hierarchical clustering may not produce satisfactory results. The fundamental question addressed in this paper is: “What is an appropriate clustering method in the presence of errors associated with data?” In the first part of this paper, we develop a statistical model and algorithms for clustering data in the presence of errors. We assume that the errors associated with data follow a multivariate Gaussian distribution and are independent between data points. The model uses the maximum likelihood principle and provides us with a new metric for clustering. This metric is used to develop two algorithms for errorbased clustering, hError and kError, that are generalizations of Ward’s hierarchical and k-means clustering algorithms, respectively. In the second part of the paper, we discuss sets of clustering problems where error information associated with the data to be clustered is readily available and where error-based clustering is likely to be superior to clustering methods that ignore error. We give examples of the effectiveness of error-based clustering on data generated from the following statistical models: (1) sample averaging, (2) multiple linear regression, (3) ARIMA time series, and (4) Markov chain models. We present theoretical and empirical justifications for the value of error based clustering on these classes of problems.
منابع مشابه
Magnetic Calibration of Three-Axis Strapdown Magnetometers for Applications in Mems Attitude-Heading Reference Systems
In a strapdown magnetic compass, heading angle is estimated using the Earth's magnetic field measured by Three-Axis Magnetometers (TAM). However, due to several inevitable errors in the magnetic system, such as sensitivity errors, non-orthogonal and misalignment errors, hard iron and soft iron errors, measurement noises and local magnetic fields, there are large error between the magnetometers'...
متن کاملGraph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members
Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...
متن کاملInstrumental Variables Regression with Measurement Errors and Multicollinearity in Instruments
In this paper we obtain a consistent estimator when there exist some measurement errors and multicollinearity in the instrumental variables in a two stage least square estimation of parameters. We investigate the asymptotic distribution of the proposed estimator and discuss its properties using some theoretical proofs and a simulation study. A real numerical application is also provided for mor...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کاملDetermination of the Best Hierarchical Clustering Method for Regional Analysis of Base Flow Index in Kerman Province Catchments
The lack of complete coverage of hydrological data forces hydrologists to use the homogenization methods in regional analysis. In this research, in order to choose the best Hierarchical clustering method for regional analysis, base flow and related index were extracted from daily stream flow data using two parameter recursive digital filters in 43 hydrometric stations of the Kerman province. Ph...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computational Statistics & Data Analysis
دوره 51 شماره
صفحات -
تاریخ انتشار 2007